WeWork's good enough order guarantee

In this talk from RabbitMQ Summit 2019 we listen to Ilana Sufrin and Avikam Agur from WeWork.

Ilana and Avikam figured out how to rearchitect their app's RabbitMQ pipeline in order to save WeWork's developers time and money. They are here to show you a strategy to guarantee message serialization when order matters, and to convince you that sometimes the best solution is just "good enough".

Short biography
Ilana Sufrin ( Twitter, GitHub ) studied sociology at NYU and began her career as an editor at the Huffington Post. She quickly realized that she'd rather be programming, and she went back to NYU for a Master's in Computer Science to facilitate her transition into tech. Most recently, Ilana has been a software engineer at Refinery29 and WeWork, where at any given time she's probably playing with an office dog. Ilana is a Pittsburgh native and has been living in NYC for ten years.

Avikam Agur ( Github ) Avikam is a former Software Engineer at Microsoft Israel who worked on various Azure cloud products. Later, he was the first engineer in the successful fin-tech company, Pagaya, based in Tel Aviv, that is specializing in peer2peer investment. During that time, he designed and implemented the data pipeline, powered by RabbitMQ, which ingests the company's database used by it to strategize its investments.

After leaving Pagaya, Avikam joined WeWork in NYC to the team that is managing the WeWork memberships.

WeWork's "good enough" order guarantee

I’m Ilana and this is my co-worker Avikam and we are software engineers at WeWork. We will take you through a real-life architecture review of how to solve a real problem that we had with the way when we were previously using RabbitMQ stack at WeWork.

The Problem

We are software engineers on the Membership’s Team at WeWork and manage “everything membership”. This involves getting a membership that will enable you to book offices or any other type of space, add people to your membership so they have access to work space, and anything similar. We have a micro service that keeps track of all these membership states and different services publish their events that our service listens to. 

Right away, you can see where this may go wrong especially how we used to think about organizing all these events. We used to think of making everything asynchronous, obviously that’s the way to go right? But, what if we get an event to add one office to a company and afterwards we get another event telling us to add two more people to that office? Adding two people is a very quick operation that might complete before adding the office, if they were picked up by two asynchronous workers. This actually happened quite often, causing two different accounts to be in incorrect states. Hence, an engineer had to manually go and fix it. 

Now it’s taking quite a lot of developer time, like staring at messages to understand where did they get out of sync? It was very manual and quite annoying. A little disclaimer here…this is all real and we’re going to be showing you real architecture diagrams that we made in order to try to solve the problem. So, if you see the word Pegasus on it that is our Memberships API, which is the main system that we’re going to be talking about. If you see the name Spaceman that is our Reservations Service. We’re not hiding anything from you. You’re getting the raw slides. 

Code Evolution

The code that is here is an excerpt to demonstrate how our code evolved into separating the consumption and produce of these messages. At the beginning, we had everything processed in the same way a long wave like the space service would handle all of these things. We later transitioned into having this micro-services architecture and then we would hit our services and all of these things would have been coupled. We gained the synchronization just by having these things processed in a very coupled way. 

As we kind of evolved, we separated all of these tasks into RabbitMQ and we decided to process everything asynchronously. As we evolved even further, the micro-tasks like these messages became smaller with smaller responsibilities just like add space or just add a person. This is a good thing as it is easier to maintain such kind of a code. But, when things needs to be actually coupled, these wins problems actually started to surface and that’s what gave birth to the problem that was just described.

Our present ideas started to solve this problem. But, one important thing to keep in mind is that we really wanted to solve it from within our domain so that we could transition back, just like making everything very tightly coupled. Performing these synchronously defeats the purpose of having things separated. Also, further we cannot really control other services. 

First try: Lock

So, what is the simplest thing that we could think of in order to try to solve our asynchronous problem? We came up with the idea of processing only one message at a time. Clearly, our events cannot get out of order if we’re doing only one thing at a time. So this is a real diagram that we made in order to work through this solution:

We were reusing Redis as a distributed block. The idea was that we would basically push a key and iratus being like yes we’re busy right now. Then if a worker came to work, it would first check to see if any other account was having any work and if so it would wait. We rejected this idea for a few reasons. Firstly, it really doesn’t scale at all as we can process only one message at a time. Also, it has all the problems inherent to distributed blocks. 

So, what if a worker that came in and took the lock died before releasing the lock? Obviously, it’s a solved problem that happens all the time. But, we are just two software engineers on a business logic team so we really did not have to deal with that. So, based on a few reasons we decided to reject this idea.

Try #2: A more complicated distributed lock

Our second solution is a more complicated version of the first. So, instead of processing only one message at a time we would now lock accounts instead of just locking the whole system. We would push account UUIDs into Redis and if any work came in for any particular account we would lock that account before allowing any more work to be done on it. 

This actually did scale way better than the first solution that we presented. However, it still has all the problems that are inherent to distributed locking. We really didn’t want to build all of that solution ourselves.

Try #3: Transient Queue per Account

So, to be able to get rid of a lock we came to a sort of realization that we were willing to process one message at a time for a single account. But, we really didn’t want to give up processing multiple accounts at the same time.

So, the third idea that we talked about was having a queue designated specifically to an account. When a message would come in from whatever publisher our membership service would understand it belongs to which account and then pass that message. We published this message to a specific key that has only one consumer for this account. 

However, this also raised some concerns. One of the most obvious is that we could potentially have infinitely many keys as we cannot control the number of accounts that we have. Such as the different messages that are sent to different accounts.

We thought of solving the problem. We removed the keys that were empty. So, we were thinking of implementing a sort of transient queue, which is also a concurrent issue such as how can you guarantee that a new message is in a queue before you decide to delete it? It’s not an atomic operation.

How did we solve the problem?

We liked that the third solution did not have locks, it was pretty scalable, and it could process multiple accounts at the same time. But the only real problem was that it wasn’t simple at all.

We thought we could do better. We got only two requests per second. We did not need a very complicated solution to solve our problem. So we thought what if we could modify it. Instead of having an arbitrary number of queues what if we could have a fixed number of queues. This brought us to the solution that we actually ended up going with. 

Our solution

In order to get rid of these transient queues that were popping up and going down, we decided to have a fixed amount of queues and then understand to which queue do we want to republish a message like use a hash function on the account? This would guarantee that messages for the same account would go to the same queue and we could always process multiple queues at the same time.

A message comes from the reservation service to add a couple of members to the account ‘ABC’. We can see that the first letter of the account is A, which always maps to Queue 1. Queue 1 has a single SideKiq worker pulling from it so that when a message comes in to delete two members from the same account we see that the first letter of the Account UUID is still “A”. So, it gets processed in sequence because Queue 1 only has one worker pulling from it. It is all synchronous now. 

Exclusive Consumer

All of this idea is based on the fact that we can guarantee there’s only one worker per queue. In order for us to get this guarantee we need to explore the use of RabbitMQ feature of exclusive consumers. This is because we thought of cases where workers could die, so in order to overcome that we created some sort of a background job that would constantly call the broker and ask how many consumers does a given queue have? And, if that answer was zero, we would spin up another queue.

This is also a potential concurrent issue because what happens if for whatever reason we decide to spin up consumers? So, for exactly that reason we’re really happy to use the exclusively consumed queue flag so that RabbitMQ would promise us that wouldn’t occur.

Before and After

So, the question that really matters is that did it work? I’m happy to report that it really did! So, this is a graph overtime that we’ve pulled from Datadog. The wavy line basically means that the messages were not successfully processed because there is a problem with accounts. The sharp line in the middle which you see is our deployment when we released our solution. Right afterwards the accounts got way more stable and messages began to be successfully processed every time. 

In fact, the reason why there are so many misses I think is because we still had to rectify accounts that were messed up from the previous architecture. But after we did that, things were way more stable. 

Things we wish we tried

Even though we solved the problem and everything is great, I don’t think that we did everything perfectly. Probably, we should have tried a few more things first. But like Avikam said, we were adamant that we wanted to solve it within our team and with the knowledge we had; do things we could maintain easily. However, apparently there’s a RabbitMQ plugin called Consistent Hashing that basically does what we described. We didn’t vet it so we can’t promise you that it solves the problem. But, apparently it does!

Also our solution is a lot like Kafka, which is like the distributed caching key. We didn’t really use Kafka a lot at WeWork. I can’t tell you if this is true or not. Probably, we should have looked into these technologies a little bit first before deciding to roll out a solution to the problem.

Key takeaways

So, what we really want you to get from this talk is less is more. Sometimes you don’t need the most complicated solution to a problem. And, if you can’t describe it very easily, chances are it’s too complicated and you should roll back a little. But, even if you’re using a simple solution you can still use the cool RabbitMQ features like exclusive consumers. So, we’re really able to leverage the technology that we knew how to use in order to solve our problem.

[Applause]

CloudAMQP - industry leading RabbitMQ as a service

Start your managed cluster today. CloudAMQP is 100% free to try.

13,000+ users including these smart companies